Note that PM2.5 are very small particles that are \(2.5\) micrometers or less in diameter. In terms of the methodology of how the PM2.5 was computed, quarterly means were extracted from the monitors for \(2012-2014\), and a mean was calculated by averaging across \(2012-2014\). Moreover, a spatial model estimated PM2.5 for each census tract within fifty kilometers of an air monitoring station. If further than fifty kilometers, satellite observations were used to assign PM2.5.
It seems the regions near Oakland and Berkeley experience a relatively high concentration of PM2.5. On the other hand, places near San Jose and Sonoma experience a moderate amount compared to Oakland and Berkeley.
Note that the “Asthma” is the # of Emergency Visits per \(10000\) People for Asthma between \(2011-2013\).
It looks like regions near and close to Vallejo and Alameda experience a relatively large # of Emergency Visits per \(10000\) People for Asthma. On the other hand, places near San Jose and Palo Alto experience a significantly smaller # of Emergency Visits per \(10000\) People for Asthma.
## `geom_smooth()` using formula 'y ~ x'
From the scatter plot, it appears there is a positive correlation between PM2.5 and Asthma. Note that the best-fit line aligns with this observation since it is a positive slope. Moreover, there appears to be more variation in # of Emergency Visits per \(10000\) People for Asthma when PM2.5 is around \(8\) and \(9\).
##
## Call:
## lm(formula = Asthma ~ PM2.5, data = MergedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.47 -25.89 -9.61 12.94 182.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -116.278 13.040 -8.917 <2e-16 ***
## PM2.5 19.862 1.534 12.950 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.49 on 1578 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.09606, Adjusted R-squared: 0.09549
## F-statistic: 167.7 on 1 and 1578 DF, p-value: < 2.2e-16
Note, we can interpret the coefficient for PM2.5 as follows. That is, a one unit increase of PM2.5 is associated with an increase of \(19.862\) in the # of Emergency Visits per \(10000\) People for Asthma, assuming everything else stays constant. Moreover, we can use the Standard Error to establish a \(95\%\) CI for \(\beta_{\text{PM2.5}}\). That is, if we perform a sufficiently large number of simulations, we cover the true parameter of \(\beta_{\text{PM2.5}}\) \(95\%\) of the time. We can also interpret the \(R^2\). That is, \(9.606\%\) of the variation in the # of Emergency Visits per \(10000\) People for Asthma is explained by the variation in PM2.5.
First, let \(\epsilon\) indicate the residuals for this particular model. Note that a common assumption of linear regression is that \(\epsilon \sim (0, \sigma^2 I_n)\), where \(I_n\) is the \(n \times n\) Identity. Hence, a problem with this plot is that it appears that it is not centered at \(0\). Moreover, it is slightly skewed to the right, whereas a more desirable outcome would to have a more symmetric distribution. Hence, we perform a log transformation below.
## `geom_smooth()` using formula 'y ~ x'
From the scatter plot, it appears there is still a positive correlation between PM2.5 and Asthma. Moreover, it appears to that the variance is more stabilized than before after performing a log transformation.
##
## Call:
## lm(formula = LogAsthma ~ PM2.5, data = MergedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00402 -0.46479 0.03313 0.42298 1.75525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.69234 0.22840 3.031 0.00248 **
## PM2.5 0.35633 0.02686 13.264 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6566 on 1578 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1003, Adjusted R-squared: 0.09974
## F-statistic: 175.9 on 1 and 1578 DF, p-value: < 2.2e-16
Note, we can interpret the coefficient for PM2.5 as follows. That is, a one unit increase of PM2.5 is associated with \(\exp(0.35633) \approx 1.4\) times more with regards to the # of Emergency Visits per \(10000\) People for Asthma, assuming everything else stays constant. We can also interpret the \(R^2\). That is, \(10.03\%\) of the variation in the Log # of Emergency Visits per \(10000\) People for Asthma is explained by the variation in PM2.5.
It now appears that the residuals are relatively close to being centered at \(0\). Moreover, it has a more symmetric distribution, which is what we wanted. Now, we plot the map of the residuals.
## [1] "The Census Tract of Interest is: 6085513000"
If we look in the CalEnviroScreen Data, the Census Tract \(6085513000\)’s Approximate Location is Stanford (in Santa Clara County)! Note that Stanford has a relatively high PM2.5 . Now, let \(Y \in R^n\) indicate our observed response (i.e. Log Asthma) and \(\hat{f}(X) \in R^d\) indicate our fitted values from the log transformation. Then, a negative residual suggests that
\[Y_{i} - \hat{f}(X_i) \leq 0 \implies Y_{i} \leq \hat{f}(X_i).\] Hence, a negative residual implies we are over-estimating. Note that there could be many reasons why we over-estimate. First, it may just be the inherent model itself. We may not even be sure if the linear model is necessarily the “best” model to choose from. That is, we could fit using other statistical models, such as polynomial regression. In this case, we could perform further diagnostics by measuring the model’s performance using a training and test set. Finally, we could be over-estimating because the other points in the model may have some “leverage” over other points. For instance, if there are outliers, they may be pulling the least squares line toward a certain direction.